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Abstract: Let {X, Y) be a random variable consisting of an observed fea- 
ture vector X ^ X and an unobserved class label Y G {1, 2, . . . , L} with un- 
known joint distribution. In addition, let 2? be a training data set consisting 
of n completely observed independent copies of (X, Y). Usual classification 
procedures provide point predictors (classifiers) Y{X, T)) of Y or estimate 
the conditional distribution of Y given X. In order to quantify the certainty 
of classifying X we propose to construct for each S = 1, 2, . . . , L a p- value 
iTg(X,T>) for the null hypothesis that Y = 9, treating Y temporarily as a 
fixed parameter. In other words, the point predictor Y{X, D) is replaced 
with a prediction region for Y with a certain confidence. We argue that 
(i) this approach is advantageous over traditional approaches and (ii) any 
reasonable classifier can be modified to yield nonparametric p-values. We 
discuss issues such as optimality, single use and multiple use validity, as 
well as computational and graphical aspects. 
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1. Introduction 

Let {X, Y) be a random variable consisting of a feature vector X € X and a class 
label F £ 9 := {1, . . . ,L} with L > 2 possible values. The joint distribution 
of X and Y is determined by the prior probabilities wg := P(F = 6) and 
the conditional distributions Pe C{X \ Y = 6) for all 6 <E Q. Classifying 
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such an observation {X, Y) means that only X is observed, while Y has to be 
predicted somehow. There is a vast literature on classification, and we refer to 
McLachlan [7], Ripley [10] or Fraley and Raftery [4] for an introduction and 
further references. 

Let us assume for the moment that the joint distribution of X and Y is known, 
so that training data are not needed yet. In the simplest case, one chooses a 
classifier Y : X ^ Q, i.e. a point predictor of Y. A possible extension is to 
consider Y : X {O}U0, where Y{X) = means that no class is viewed as 
plausible. A Bayesian approach would be to calculate the posterior distribution 
of Y given X, i.e. the posterior weights we{X) :— W{Y — 0\X). In fact, a 
classifier Y* satisfying 

Y*{X) G argmaxit;e(X) 

is well-known [7, Chapter 1] to minimize the risk 

R{Y) := W{Y{X) ^ Y). 

An obvious advantage of using the posterior distribution instead of the simple 
classifier Y* (or Y) is additional information about confidence. That means, for 
instance, the possibility of computing the conditional risk P(y*(X) ^ Y \ X) = 
1 — u\&'x.QWg{X). However, this depends very sensitively on the prior weights 
wg . Small changes in the latter may result in drastic changes of the posterior 
weights W0{X). Moreover, if some classes 6 have very small prior weight, the 
classifier Y* tends to ignore these, i.e. the class-dependent risk W{Y*{X) ^ 

Y \ Y ~ 0) may be rather large for some classes 9. For instance, in medical 
applications each class may correspond to a certain disease status while the 
feature vector contains information about patients, including certain symptoms. 
Here it would be unacceptable to classify each person as being healthy, just 
because the diseases in question are extremely rare. Note also that some study 
designs (e.g. case-control studies) allow for the estimation of the Pq but not the 
We- Moreover, there are applications in which the wg change over time while it 
is still plausible to assume fixed conditional distributions Pg. 

Another drawback of the posterior probabilities wg{X) is the following: Sup- 
pose that the prior weights wg are all identical and that for some subset 6o of 
G with at least two elements the conditional distributions Pg, G Qo, are very 
similar. Then the posterior distribution of Y given X divides the mass corre- 
sponding to Qo essentially uniformly among its elements. Even if the point X is 
right in the 'center' of the distributions Pg, 9 S 0o, so that each class in Oo is 
perfectly plausible, the posterior weights are not greater than l/#0o. If we{X) 
is viewed merely as a measure of plausibility of class 9, there is no compelling 
reason why these measures should add to one. 

To treat all classes impartially, we propose to compute for each class S a 
p- value 'ng{X) of the null hypothesis that Y = 9. (In this formulation we treat 

Y temporarily as an unknown fixed parameter.) That means, T^g : X ^ [0, 1] 
satisfies 

V{TTg{X) <a\Y ^9) <a for aU a e (0, 1). (1.1) 
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Given such p- values irg, the set 

y^iX) {e e e : 7rg{X) > a} 

is a (1 — Q;)-prediction region for Y, i.e. 

^{Y e ydX) I r = 6*) > 1 - a for arbitrary 61 £ 9, a £ (0, 1). 

If ya{X) happens to be a singleton, we have classified X uniquely with given 
confidence I — a. In case of 2 < #3^0 (X) < L we can at least exclude some 
classes with a certain confidence. 

So far the classification problem corresponds to a simple statistical model 
with finite parameter space O. A distinguishing feature of classification prob- 
lems is that the joint distribution of (X, Y) is typically unknown and has to 
be estimated from a set V consisting of completely observed training obser- 
vations {Xi,Yi), {X2,Y2), {Xn,Yn). Let us assume for the moment that 
all n -|- I observations, i.e. the n training observations {Xi,Yi) and the current 
observation (X,Y), are independent and identically distributed. Now one has 
to consider classifiers Y{X,'D) and p- values 7rg(X, P) depending on the current 
feature vector X as well as on the training data T). In this situation one can 
think of two possible extensions of (f -1): For any 6 £ Q and a € (0, 1), 

]P{7rg{X,V) <a\Y = 9) < a, (1.2) 
]P{TTe{X,V) < a\Y ^ 0, V) < a + Op{l) as n -> oo. (1.3) 

It will turn out that Condition (1.2) can be guaranteed in various settings. 
Condition (1.3) corresponds to "multiple use" of our p-values: Suppose that we 
use the training data V to construct the p-values t:q{-,'D) and classify many 
future observations {X,Y). Then the relative number of future observations 
with Y = b and TTe{X,V) < a is close to 

w,-TP{7:e{X,V) <a\Y = b, V), 

a random quantity depending on the training data P. 

P-values as discussed here have been used in some special cases before. For 
instance, McLachlan's [7] "typicality indices" are just p-values 7Tg{X,V) sat- 
isfying (1.2) in the special case of multivariate gaussian distributions Pg; see 
also Section 3. However, McLachlan's p-values are used primarily to identify 
observations not belonging to any of the given classes in 0. In particular, they 
are not designed and optimized for distinguishing between classes within Q. 
Also the use of receiver operating characteristic (ROC) curves in the context 
of logistic regression or Fisher's [■!] linear discriminant analysis is related to the 
present concept. One purpose of this paper is to provide a solid foundation for 
procedures of this type. 

The remainder of this paper is organized as follows: In Section 2 we return 
to the idealistic situation of known prior weights wg and distributions Pg . Here 
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we devise p-values that arc optimal in a certain sense and related to the op- 
timal classifier mentioned previously. These p-values serve as a gold standard 
for p-values in realistic settings. In addition we describe briefly McLachlan's [7] 
typicality indices and a potential compromise between the these p-values and 
the optimal ones. 

Section 3 is devoted to p-values involving training data. After some general 
remarks on cross-validation and graphical representations, we discuss McLach- 
lan's [7] p-values in view of (1.2) and (1.3). Nonparametric p-values satisfying 
(1.2) without any further assumptions on the distributions Pe are proposed in 
Section 3.3. These p-values are based on permutation testing, and the only prac- 
tical restriction is that the group sizes Ne := : Yi = 6} within the training 
data should exceed the reciprocal of the intended test level a. We claim that any 
reasonable classification method can be converted to yield p-values. In partic- 
ular, we introduce p-values based on a suitable variant of the nearest-neighbor 
method. Section 3.4 deals with asymptotic properties of various p-values as the 
size n oiT) tends to infinity. It is shown in particular that under mild regularity 
conditions the nearest-neighbor p-values are asymptotically equivalent to the 
optimal methods of Section 2. These results are analogous to results of Stone 
[12, Section 8] for nearest-neighbor classifiers. In Section 3.5 the nonparamet- 
ric p-values are illustrated with simulated and real data. Finally, in Section 3.6 
we comment on Condition (1.3) and show that the Op(l) cannot be avoided in 
general. 

In Section 4 we comment briefiy on computational aspects of our methods. 
Section 5 introduces the notion of 'local identifiability' for finite mixtures, which 
is of independent interest. For us it is helpful to define the optimal p-values 
in a simple manner and it is also useful for the asymptotic considerations in 
Section 3.4. Proofs and technical arguments are deferred to Section 6. 

Let us mention a different type of confidence procedure for classification: 
Suppose that [ag{X,V),bg{X,'D)^ is a confidence interval for wg{X). Precisely, 
let ae(X,r») < weiX) < he{X,V) for aU 6* e 9 with probability at least 1 - a. 
Then 



would be a prediction region for Y such that Y*{X) C y{X, V) with probability 
at least 1 — a. Note, however, that this gives no control over the probability that 
Y ^ y{X, V). In fact, the latter probability could be close to 50 percent. By way 
of contrast, with the p-values in the present paper we can guarantee to cover Y 
with a certain confidence, even in situations where consistent estimation of the 
conditional probabilities wg{X) is difficult or even impossible. 

2. Optimal p-values and alternatives 

Suppose that the distributions Pi , . . . , Pl have known densities /i , . . . , > 
with respect to some measure M on X. Then the marginal distribution of X 




L. Diimbgen et al./P-values for classification 



472 



has density / := J2bee ^bft with respect to M, and 

wgfe{x) 

"^^"^ = -jw 

Hence the optimal classifier Y* may be characterized by 

Y*{X) e aigmax wgfg{X). 



2.1. Optimal p-values 

Here is an analogous consideration for p-valucs. Let tt = {'Kg)g<^Q consist of 
p-values TTg satisfying (1.1). Given the latter constraint, our goal is to provide 
small p-values and small predicion regions. Hence two natural measures of risk 
are, for instance, 

7^(7r) := IE^7re(X) or 7^<,(7^) := M#y^{X). 

Elementary calculations reveal that 

7^(7r) = / TZaiTr) da and TZain) = TZajTig) 

see 

with 

■Ro^iTig) := T{ng{X) > a). 

Thus we focus on minimizing TZa{TTg) for arbitrary fixed 9 € Q and a £ (0,1) 
under the constraint (1.1). Since x i— > l{7re(x) > a} may be viewed as a level-a 
test of Pg versus X^bee '^bPb, a straightforward application of the Neyman- 
Pearson Lemma shows that the p- value 

n*gix) := Pg{zeX: ifg/f){z) < ife/f){x)} 

is optimal, provided that the distribution C(^{fe/ f){X)^ is continuous. Two 
other representations of tt^ are given by 

ng{x) = Pg{z e X : wg{z) < wg{x)} 

= Pg{zeX ■.Tg*{z)>Tg*{x)} 

with Tg J2b^e ^b,efb/fe and w^.g := wt/ J2c^e ^c- The former representation 
shows that 7r| (x) is a non-decreasing function of W0{x). The latter representation 
shows that the prior weight wg itself is irrelevant for the optimal p- value TTg(x); 
only the ratios Wc/wb with b,c =^ 9 matter. In particular, in case of L = 2 
classes, the optimal p-values do not depend on the prior distribution of Y at all. 

Here and throughout this paper we assume the likelihood ratios Tg[X) to 
have a continuous distribution. It will be shown in Section 5 that many standard 
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families of distributions fulfill this condition. In particular, it is satisfied in case 
oi X = M.'^ and Pg ~ Mq[\iB^ Se) with parameters (/ie, Ee), nonsingular, not 
all being identical. Further examples include the multivariate i-family as it has 
been advocated by Peel and McLachlan [8] to robustify cluster and discriminant 
analysis. These authors also discuss maximum likelihood via the EM algorithm 
in this model. Without the continuity condition on £(Tg (X)) one could still 
devise optimal p- values by introducing randomized p- values, but we refrain from 
such extensions. 

Let us illustrate the optimal p-values in two examples involving normal dis- 
tributions: 

Example 2.1. (Standard model) Let Pg = Aq(/^6i,E) with mean vectors 
yUg G 'W^ and a common symmetric, nonsingular covariancc matrix E G R^^^. 
Then 

Tgix) = ^Wb^eexp((x - //e,(,)^E"i(pi,, - /ie)) (2.1) 

with := 2~^{p.g + /_*(,). In the special case of L = 2 classes, let Z[x) := 
{x — /^i_2)^E^-^ (/i2 — /ii)/||/ii — /i2||s with the Mahalanobis norm := 

1 /2 

. Then elementary calculations show that 

^{{x) = $(~Z(a:)-||/ii-/.2||s/2), 
^2*(.t) = $(+Z(x)-||mi-a^2||e/2), 

where $ denotes the standard gaussian c.d.f.. In case of — /i2|[s/2 > $^^(1 — 
a), 

r {1} if Z{x) < -Wfii - fi2h/2 + $-1(1 - a), 
y^{x) = h2} if Z(a;) > -A'2||e/2-$-H1-«), 
[0 else. 

Thus the two classes are separated well so that any observation X is classified 
uniquely (or viewed as suspicious) with confidence 1 — a. In case of \\fii — 
M2||s/2 < $-1(1 — a), the feature space contains regions with unique prediction 
and a region in which both class labels are plausible: 

[ {1} if Zix) < - /i2||s/2 - $-1(1 - a), 

yc^ix) = U2} if Z(.t) > -||Aii-/i2||s/2 + $-i(l-a), 
[{1,2} else. 

Example 2.2. Consider L = 3 classes with equal prior weights wg = 1/3 and 
bivariate normal distributions Pg = J\^2{^J.g, Ee), where 

Ml = (-1,1)^, M2 = (-1,-1)^, = (2,0)^ 

and 

E -E - ( ^ E - (^-"^ M 



L. Diimbgen et al./P-values for classification 



474 




Fig 1. P-value functions tTj (top left), tTj (bottom left), vrj (top right) and a typical data set 
(bottom right) for Example 2.2. 

Figure 1 shows a typical sample from this distribution and the corresponding 
p-value functions 7r|. The latter are on a grey scale with white corresponding 
to zero and black corresponding to one. The resulting predition regions ya{x) 
for a = 5% and a ~ 1% are depicted in Figure 2. In the latter plots, the color 
of a point a; G has the following meaning: 
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(The configuration ya{x) — {1,3} never appeared.) Note the influence of a: 
On the one hand, 3^0.05(2;) = for some x £ but 3^o.o5(-) 7^ {1,2,3} in the 
depicted rectangle. On the other hand, 3^0.01(2;) — {1,2,3} for some a; S 
while 5^0.01 (•) ^ 0- 
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2.2. Typicality indices 

An alternative definition of p- values is based on the densities themselves, namely, 

Tgix) := Pe{zeX:fg{z)<fe{x)}. 

These typicality indices quantify to what extent a point x is an outlier with 
respect to the single distributions Pg . These p- values Tg are certainly suboptimal 
in terms of the risk TZa{T^g)- On the other hand, they allow for the detection of 
observations which belong to none of the classes under consideration. 

Example 2.3. Again let X = W and Pg = Mq{^ie,T,g). Since fe{X) is a strictly 
decreasing function of — /xelH^ with conditional distribution given Y = 9, 
the typicality indices may be expressed as 

where Fq denotes the c.d.f. of x^. These p- values allow for the separation of two 
different classes 0,h ^ Q only if 

is sufficiently large. Thus they suffer from the curse of dimensionality and may 
yield much more conservative predition regions than the p- values tt^ . 

2.3. Combining the optimal p-values and typicality indices 

The optimal p-values 7r| and the typicality indices Tg may be viewed as extremal 
members of a whole family of p-values if we introduce an additional class label 
with 'density' /o = 1 and prior weight wo > 0. Then we define the compromise 
p-value ^ ^ 

M^) PeU e X : [fg/fKz) < (fg/fKx)} 
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with / := 'l2b=o^t,fb = / + Wo- Note that ng — > tq pomtwise as Wq — *■ oo, 
whereas irg ^ Wg as ^ 0. 

Example 2.4. In the setting of Example 2.1 there is another modification 
which is similar in spirit to Ehm et al. [1]: When defining the p- value for a 
particular class 9 wc replace the other distributions Pb = Afq^ib, S), h ^ 9, with 
Pb ~ Mq{^bi cS) for some constant c > 1. Thus our modified p-valuc becomes 

ne{x) := Peiz e X : fe{z) >fg{x)] , 

where 

L 

fg{x) = ^u;,,,exp(||x-Aie||l/2-|k-A*;,ll|/(2c)) 

6=1 
L 

= J2 '^b,e cxp((l - c~')\\x - ^e,b\\y2 - (c - l)-'\\fib - Ml/^) 

6=1 

with vg^b ■= Me - (c - l)~^{fib - fJ-e)- 
3. Training data 

Now we return to the realistic situation of unknown distributions Pg and p- 
values Trg{X,D) with corresponding prediction regions ya{X,'D). From now on 
we consider the class labels Yi,Y2, . . . ,Y„ as fixed while Xi, X2, ■ ■ ■ , Xn and 
{X,Y) are independent with C{Xi) = Py^- That way we can cover the case of 
i.i.d. training data (via conditioning) as well as situations with stratified training 
samples. In what follows let 

Gg {te{l,...,n}:Y, = 9} and Ng ^Qg. 

We shall tacitly assume that all group sizes Ng are strictly positive, and asymp- 
totic statements as in (1.3) are meant as 

n — > 00 and Nb/n Wb for all b £ Q. (3-1) 

3.1. Visual assessment and estimation of separability 

Before giving explicit examples of p- values, let us describe our way of visualizing 
the separability of different classes by means of given p- values Trg{-,-). For that 
purpose we propose to compute cross- validated p- values 

TTg{Xt, Vi) 

for i = 1,2, ...,n with 2?^ denoting the training data without observation 
{Xi^Yi). Thus each training observation {Xi,Yi) is treated temporarily as a 
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'future' observation to be classified with the remaining data P^. Then we dis- 
play these cross-validated p-values graphically. This is particularly helpful for 
training samples of small or moderate size. 

In addition to graphical displays one can compute the empirical conditional 
inclusion probabilities 

:= #{tegb:OeMX,,V,)}/Nb 
and the empirical pattern probabilities 

ra.ib,S) #{iegb:yo.iX,,V,)^ S}/Nb 

for b,9 € Q and S" C 8. These numbers 6) and Pa{b, S) can be interpreted 
as estimators of 

Iaib,9\V) := TP{e eyo.iX,V)\Y = b,V) 

and 

ra.ib,S\V) TP{y^iX,V)^S\Y = b,V), 

respectively; see also Section 3.4. 

For large group sizes Nb, one can also display the empirical ROC curves 

(0,1) 3a ^ 1 - J„(6,6i) 

which are closely related to the usual ROC curves employed, for instance, in 
logistic regression or linear discriminant analysis involving L = 2 classes. 

3.2. Typicality indices 

For the sake of simplicity, suppose that Pg ~ Afq^fig^T,) with unknown mean 
vectors fii, . . . ,fiL S and an unknown nonsingular covariance matrix E G 
ffi.''^''. Consider the standard estimators 

n 

Then the squared Mahalanobis distance 

Te{X,V) \\X -^e\\^ 

can be used to assess the plausibility of class 9, where we assume that n > L + q. 
Precisely, 

^ jn-L-q+l) 

is a normalizing constant such that 

CeTg{X,V) ~ Fq^n-L-q+i \ Y ^ 9; 
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see [7]. Here F^.z denotes the _F-distribution with k and z degrees of freedom, 
and we use the same symbol for the corresponding c.d.f.. Hence the typicahty 
index 

Te{X,V) := l-Fq,,^L-q+i{CeTe{X,V)) 

is a p-value satisfying (1.2). Moreover, since the estimators /ib and S are con- 
sistent, one can easily verify property (1.3) as well. 

Example 3.1. An array of ten electrochemical sensors is used for "smelling" 
different substances. In each case it produces raw data X g M"'^° consisting of 
the electrical resistances of these sensors. Before analyzing such data one should 
standardize them in order to achieve invariance with respect to the substance's 
concentration. One possible standardization is to replace X with 

/ in \ 9 



X [X{j) Y.X{k) 



\ fc=i / j=i 

Thus we end up with data vectors in R^. For technical reasons, group sizes Ng 
are typically small, and not too many future observations may be analysed. This 
is due to the fact that the system needs to be recalibrated regularly. 

Now wc consider a specific dataset with "smells" of L = 12 different brands 
of tobacco and fixed group sizes Ng = 3 for all 6* S 8. We computed the cross- 
validated typicality indices TeiXi^Vi) described above. Figure 3 depicts for each 
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Fig 3. Cross-validated typicality indices for tobacco "smells" . 
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Fig 4. 0.99 -confidence predction regions for tobacco "smells". 



training observation {Xi, Yi) the p- values Ti{Xi, Vi), . . . , Ti2{Xi, Vi) as a row of 
twelve rectangles. The area of these is proportional to the corresponding p- value. 
The first three rows correspond to data from the first brand, the next three rows 
to the second brand, and so on. Figure 4 displays the corresponding prediction 
regions 3^Q,(Xi, P,) for a = 0.01. Within each row the elements of ya{Xi,T>i) 
are indicated by rectangles of full size. These pictures show classes 1 and 2 
are separated well from the other eleven classes. Classes 5, 8, 9 and 12 overlap 
somewhat but arc clearly separated from the remaining eight classes. Finally 
there arc three pairs of classes which arc essentially impossible to distinguish, 
at least with the present method, but which arc separated well from the other 
ten classes. These pairs arc 3-4, 6-7, and 10-11. It turned out later that brands 
6 and 7 were in fact identical. Note also that all except one prediction region 
ya:{Xi,Vi) contain the true class and at most three additional class labels. 



3.3. Nonparametric p-values via permutation tests 

For a particular class 9 let /(I) < I{2) < ■ ■ ■ < I{Ne) be the elements of Qg. An 
elementary but useful fact is that {X, Xj(^ij, Xj(^2)j ■ ■ ■ i^/(7Vs)) is exchangeable 
conditional on F = 6*. Thus let Tg{X,'D) be a test statistic which is symmetric 
in {Xj(^j))^J^^. We define 'Di{x) to be the training data with x in place of Xi. 
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Then the nonparametric p-value 



e Ge ■■ Te{X„V,{X)) > Te{X,V)} + 1 

Ng + 1 



(3.2) 



satisfies requirement (1.2). Since irg > {Ng + 1)^^, tliis procedure is useful only 
if Ng + 1 > a^^. In case of a = 0.05 this means that Ng should be at least 19. 

As for the test statistic Tg {X, V) , the optimal p-value in Section 2 suggests 
using an estimator for the weighted likelihood ratio Tg{x) or a strictly increasing 
transformation thereof. In very high-dimensional settings this may be too am- 
bitious, and Tg {X, 2?) could be any test statistic quantifying the implausibility 
of"F = 6'". 

Plug-in statistic for standard gaussian model. For the setting of Exam- 
ple 2.1 and Section 3.2 one could replace the unknown parameters Wc, fJ-c and 
E in Tg with Nc/n, flc and E, respectively. Note that the resulting p- values 
always satisfy (1.2), even if the underlying distributions Pc are not gaussian 
with common covariance matrix. 

Nearest-neighbor estimation. One could estimate we(-) via nearest neigh- 
bors. Suppose that d{-, •) is some metric on X. Let B(x, r) :— {y G X : d{x, y) < 
r}, and for a fixed positive integer k < n define 

r^,{x) =ffc(.T,2?) := min{r > : #{i < n : X, G B{x,r)} > k}. 
Further let Pg denote the empirical distribution of the points Xi, i e Qg, i.e. 



with certain estimators Wb ~ wi,{T>) of Wf,. The resulting nonparametric p-value 
is defined with Tg{x,'D) := —wg(x,T>). Note that in case of Wb = Nb/n, we 
simply end up with the ratio 



For simplicity, we assume k to be determined by the group sizes A'^i , . . . , N]^ 
only. Of course one could define irg {X, V) with k = kg {X, V) nearest neighbors 
of X, as long as kg{X,V) is symmetric in the Ng + 1 feature vectors X and 
Xi, i G Ge- Moreover, in applications where the different components of X are 
measured on rather different scales, it might be reasonable to replace d{-, •) with 
some data-driven metric. 



Pe{B) iV-i#{z e Gg : X^ e B} for B C X. 



Then the fc- nearest- neighbor estimator of wg{x) is given by 




wgix,V) = if{ieGe:d{X,,x)<rk{x)}/#{i<n:d{X,,x)<?k{x)} 
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Logistic regression. Suppose for simplicity that there are L = 2 classes 
and that X g R'' contains the values of d numerical or binary variables. Let 
(a, 6) = {a{'D) , b(T>)) be the maximum likelihood estimator for the parameter 
(a, 6) G M X ffi.'' in the logistic model, where 

log- — = a + b'x. 

1 - W2(X) 

Then possible candidates for Ti{x,T>) and T2{x,'D) are given by 

Ti{x,V) := a+Vx -~T2{x,V). 

Extensions to multicatcgory logistic regression as well as the inclusion of regu- 
larization terms to deal with high-dimensional covariable vectors X are possible 
and will be described elsewhere. 



3.4- Asymptotic properties 

Now we analyze the asymptotic behavior of the nonparamctric p- values ng (X, 2?) 
and the corresponding empirical probabilities Ta{h, 9) and Vlb, S). Throughout 
this section, asymptotic statements are to be understood within setting (3.1). 

As in Section 2 we assume that the distributions Pg have strictly positive 
densities with respect to some measure M on X. The following theorem implies 
that TTe{X,V) satisfies (1.3) under certain conditions on the underlying test 
statistic Tg{X,V). In addition the empirical probabilities Xa{b,9) and V{b,S) 
turn out to be consistent estimators of Xq(6, 9 \ V) and Pa (6, S\T>), respectively. 

Theorem 3.1. Suppose that for fixed 9 E O there exists a test statistic Tg on 
X satisfying the following two requirements: 

Tg{X,V) -^p T°{X), (3.3) 

C{Tq{X)) is continuous. (3.4) 

Then 

MX,V) <(X), (3.5) 

where 

n°o{x) := Pe{zEX:T^{z)>T^{x)]. 
In particular, for arbitrary fixed a G (0, 1), 

nc,{^e{-,V)) TZaiK), (3-6) 
P(7rg(X) > a\Y = b) for each 6 G 6. (3.7) 



Ic.{.b,9\V) I 
J^ib,9) J 
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If the limiting test statistic Tg is equal to Tg or some strietly increasing trans- 
formation thereof, then the nonparametric p- value Trg{-,'D) is asymptotically 
optimal. The next two lemmata describe situations in which Condition (3.3) or 
(3.4) is satisfied. 

Lemma 3.2. Conditions (3.3) and (3.4) ire satisfied in case of the plug-in rule 
for the homoscedastic gaussian model, provided that 1E(\\X\\'^) < oo and C{X) 
has a Lehesgue density. 

Lemma 3.3. Suppose that (X, d) is a separable metric space and that all densi- 
ties fb, 6 G O, are continuous on X . Alternatively, suppose that X ^M.'' equipped 
with some norm. Then Condition (3.3) is satisfied with Tg = Tg in case of the 
k-nearest-neighbor rule with wg = Ng/n, provided that 



3. 5. Examples 

The nonparametric p-values are illustrated with two examples. 

Example 3.2. The lower right panel in Figure 1 shows simulated training data 
from the model in Example 2.2, where A^i = N2 = N3 = 100. Now we computed 
the corresponding prediction regions yo.o5{x,T)) based on the plug-in method 
for the standard gaussian model (which isn't correct here) and on the nearest- 
neighbor method with k = 100 and standard euclidean distance. Figure 5 depicts 
these prediction regions. 

To judge the performance of the nonparametric p-values visually we chose 
ROC curves, where we concentrated on the plug-in method. In Figure 6 we show 
for each pair (6, 6') G 8 x 8 the true ROC curves of TTg{-) and TTg{-,V), 



both of which had been estimated in 40'000 Monte Carlo Simulations of X ^ Pg. 
In addition we show the empirical ROC curve a 1-^ 1 — Ta{b,9) (black step 
function). Note first that the difference between the (conditional) ROC curve 
of TTg{-, V) and its empirical counterpart 1 — la{b,0 \ T>) is always rather small, 
despite the moderate group sizes Nb — 100. Note further that the ROC curves 
of Trg{-,'D) and TTg{-) are also close together, despite the fact that the plug-in 
method uses an incorrect model. These pictures show clearly that distinguishing 
between classes 1 and 2 is more difficult than distinguishing between classes 2 
and 3, while classes 1 and 3 are separated almost perfectly. 

Of course these pictures give only partial information about the performance 
of the p-values. In addition one could investigate the joint distribution of the 
p-values via pattern probabilities; see also the next example. 



k = k{n) 



00 and k/n 



0. 




(magenta) 
l-Ic,{b,e\V) (blue). 
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6 = 1 


e = 2 


9 = 3 


b = ^ 








b = 2 








b = 3 






/ 



Fig 6. ROC curves for the plug-in method applied to the data in Example 3.2. 
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_ Table 1 _ 

Empirical performance of yo,o5{- , ■) and 3^0.01 ('lO in Example 3.3. 







y0.05(^«,A) 












91 


32 


= {1} 


= {2} 


= {1,2} 


31 


32 


= {1} 


= {2} 


= {1,2} 


1 


.950 


.244 


.756 


.050 


.194 


.990 


.448 


.552 


.010 


.438 




.950 


.222 


.778 


.050 


.172 


.990 


.452 


.548 


.010 


.443 




.952 


.233 


.767 


.048 


.185 


.990 


.449 


.551 


.010 


.440 


2 


.396 


.950 


.050 


.604 


.346 


.743 


.991 


.009 


.257 


.734 




.356 


.950 


.050 


.644 


.307 


.698 


.991 


.009 


.302 


.689 




.406 


.950 


.050 


.594 


.356 


.773 


.992 


.008 


.227 


.766 



Example 3.3. This example is from a data base on quality management at 
the University hospital at Liibeck. In a longterm study on mortality of patients 
after a certain type of heart surgery, data of more than 20 '000 cases have been 
reported. The dependent variable is Y £ {1,2} with Y = 1 and Y = 2 meaning 
that the patient survived the operation or not, respectively. For each case there 
were q = 21 numerical or binary covariables describing the patient (e.g. sex, 
age, various specific risk factors) plus covariables describing the circumstances 
of the operation (e.g. emergency or not, experience of the surgeon). 

We reduced the data set by taking all Ni = 662 observations with Y ^ 2 and 
a random subsample of A^i = 3A^2 = 1986 observations with Y = 1. Without 
such a reduction, the nearest-neighbor method wouldn't work well due to the 
very different group sizes. Now we computed nonparametric crossvalidated p- 
values based on the plug-in method from the standard gaussian model, logistic 
regression, and the nearest-neighbor method with k ~ 200. In the latter case, we 
first divided each component of X corresponding to a non-dichotomous variable 
by its sample standard deviation, because the variables are measured on very 
different scales. Table 1 reports the performance of ya{Xi,T>i) as a predictor of 
Yi for a = 5% and a = 1%. In each cell of the table the entries correspond to 
the three methods mentioned above. This example shows the p- values' potential 
to classify a certain fraction of cases unambiguously even in situations in which 
overall risks of classifiers are not small which is rather typical in medical ap- 
plications. Note again that the method doesn't require any knowledge of prior 
probabilities. Logistic regression yielded slightly better results than the other 
two in terms of the fraction of cases with ya{Xi,T>i) = {Yi}. The other two 
methods performed similarly. 

3.6. Impossibility of strengthening (1.3) 

Comparing (1.2) and (1.3), one might want to strengthen the latter requirement 
to 

P(7re(X,P) <a\Y ^e,V) < a almost surely. (3.8) 

However, the following lemma entails that there are no reasonable p-values 
satisfying (3.8). Recall that we are aiming at p-values such that ]P (TTg(X,V) < 
a I y ~ 6) is large for b ^ 9. 
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Lemma 3.4. Let Qi, Q2, . . . , Ql be mutually absolutely continuous probability 
distributions on X. Suppose that (3.8) is satisfied whenever (Pi,P2, ■ ■ ■ 1 Pl) is 
a permutation of (Qi, Q2, ■ ■ ■ , Ql)- In that case, for arbitrary 6 G 6, 

P(7re(X, V) <a\Y = b,V) < a almost surely. 



4. Computational aspects 



The computation of tiie p-values in (3.2) may be rather time-consuming, de- 
pending on the particular test statistic Tg{-,'D). Just think about classification 
methods involving variable selection or tuning of artificial neural networks by 
means of 2?. Also the nearest-neighbor method with some data-driven choice 
of k or the metric may result in tedious procedures. In order to com- 

pute 7T0 {■,!)) as well as 7rg{Xi,'Di) one can typically reduce the computational 
complexity considerably by using suitable update formulae or shortcuts. 

Naive shortcuts for the nonparametric p-values. One might be tempted 
to replace irg {X, V) with the naive p-values 

One can easily show that the conclusions of Theorem 3.1 remain true with 
7rg^'™(-,-) in place of Trg{-,-). However, finite sample validity in the sense of 
(1.2) is not satisfied in general, so we prefer the alternative shortcut described 
next. Note also that empirical ROC curves offered by some statistical software 
packag complement to logistic regression or linear discriminant analysis 

with two classes, are often based on this shortcut. 

Valid shortcuts for the nonparametric p-values. Often the computa- 
tions as well as the program code become much simpler if we replace Tg {X, 2?) 
and Tg{X„V,{X)) in Definition (3.2) with Te{X,V{X,e)) and Tg{Xi,V{X,e)), 
respectively, where ^{X, 9) denotes the training data V after adding the "obser- 
vation" {X,6). That means, before judging whether is a plausible class label 
for a new observation X, we augment the training data by {X^O) to determine 
the test statistic Tg{-,V{X, 9)). Then we just evaluate the latter function at the 
Ng + 1 points X and Xi, i £ Gg, to compute 

naivc.^„.^.^^ #{» £ Go : Tg{X,,V{X,9)) > T,(X, I?(X, 0))} + 1 

7T, {X,V{X,9)) = ^ . 

This p- value does satisfy Condition (1.2), and the conclusions of Theorem 3.1 
remain true as well. In this context it might be helpful if the underlying test 
statistics satisfy some moderate robustness properties, because X may be an 
outlier with respect to the distribution Pg. 
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Update formulae for sample means and covariances. In connection with 
the typicaUty indices of Section 3.2 or the plug-in method for the standard 
gaussian model, elementary calculations reveal the following update formulae 
for groupwise mean vectors and sample covariance matrices: Replacing V with 
the reduced data set Vi for some i d Qe has no impact on Jib for b ^ 6 while 

S ^ (n-L-l)-i((n-L)S-(l-Ar-i)-i(x,-/2e)(X, -/le)^), 

Jig ^ {Ng~ir\NgjLe~X,) ^ Jig ~ {Ng - l)~\X, ~ Jtg). 

Replacing V with the modified data set Vi{X) for some i ^ Qg results in 

S ^ {n - L)-^ [in - L)% 

+ (1 - N-^){{X - Jlg^i){X - Jlg,y - (X, - ne.){X, - Jle,,V)), 

Jig ^ Jlg + Ng\X -X,), 

where Jlg^i := {Ng — l)~^{NgJlg — Xi). Finally, replacing V with the augmented 
data set 'D{X, 9) means that 

£ ^ (n + l-L)-i((n-L)E + (l + A^-i)-i(X-/2e)(X-/2e)T), 

Jie ^ {Ng + l)-\NgJig + X) = Jig + {Ng + l)-\X - Jle) ■ 

Update formulae for the nearest-neighbor method. For convenience 
we restrict our attention to the valid shortcut involving 'D{X,9). To compute 
the resulting p- values 7rg'''™(X, 2?(X, 0)) quickly for arbitrary feature vectors 
X ^ X, \t is convenient to store the n(l + 2L) numbers 

ffc(X„P), Nk-lfi{X^,^^). Nu.b{X,,V) 

with i G {1, . . . , n} and G 0, where 

Ni^b{x,V) := H={ie{l,...,n}:Y,^h,d{x,X,)<rt{x,V)}. 

For then one can easily verify that 

iNk^^,b{X,,V) + l{h = 0} iid{X,,X)<?k{X,,V) 

Nu^{x„v{x,e)) = \NkA{x„v) + i{b = e} iid{x,,x) = ?k{x,,v), 

{NkM{X,,V) iid{X,,X)>?k{X^,V). 

Hence classifying a new feature vector X requires only 0{n) steps for deter- 
mining the 1 + L'^ numbers rk{X,V{X,e)) and Nb{X,V{X,0)) and the nL^ 
numbers Nb{Xi,'D{X,e)), where l<i<n and 6,6* G 6. 

Computing the crossvalidated p- values with the valid shortcut is particularly 
easy, because replacing one training observation {Xi^Yi) with {Xi^d) does not 
affect the radii rk{x,V). 

In case of data-driven choice of k or the preceding formulae are no 

longer applicable. Then the valid shortcut is particularly useful to reduce the 
computational complexity. 
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5. Likelihood ratios and local identifiability 

In previous sections we assumed that the distribution of lilcelihood ratios such as 
'W0{X) or Tg{X) is continuous. This property is related to a property which we 
call 'local identifiability', a strengthening of the well-known notion of identifiabil- 
ity for finite mixtures. Throughout this section we assume that the distributions 
Pi, P2, ■ ■ ■ , Pl belong to a given model {Q^)^£s of probability distributions 
with densities > with respect to some measure M on X . 

Identifiability. Let us first recall Yakowitz and Spragins' [f.3] definition of 
identifiability for finite mixtures. The family (Q^j^gs is called identifiable, if 
the following condition is satisfied: For arbitrary to G N let ^(1), . . . ,£,{1^) be 
pairwise different parameters in S and let Ai, . . . , A,„ > 0. If ^'(1), . . . , ^'{m) e S 
and A']^ , . . . , AJ„ > such that 



'(i)' 



then there exists a permutation cr of {I, 2, . . . , m} such that ^'(i) = ^{(^{i)) a-nd 
K = for j = 1,2, . . . ,m. 

Evidently the family ((5^){e3 is identifiable if the density functions g^, ^ G S, 
are linearly independent as elements of L^{M), and the converse statement is 
also true [13]. 

A standard example of an identifiable family is the set of all nondegenerate 
gaussian distributions on W\ see [l-i]. Holzmann et al. [G] provide a rather com- 
prehensive list of identifiable classes of multivariate distributions. In particular, 
they verify identifiability of families of elliptically symmetric distributions on 
X = W^ with Lebesgue densities of the form 

g^{x) = Aet{i:)-^/^ h,{{x - y)'^i:-\x - y)-0. (5.1) 

Here the parameter ^ = (/i, E, (^) consists of an arbitrary location parameter /i G 
W , an arbitrary symmetric and positive definite scatter matrix S G M'^' and an 
additional shape parameter C which may also vary in the mixture. For each shape 
parameter C,, the 'density generator' /ig(-; C) is a nonnegative function on [0, c») 
such that j^hq{\\x\\'^]C,) dx = 1. One particular example are the multivariate 
^-distributions with 



;,(,.0 - r((C + g)/2) .-(C+.)/2 

^^9^^'^^ - W2r(c/2) 



for C > 0. We mention that the subsequent arguments apply to most of the 
elliptically symmetric families discussed by Holzmann et al. [6]. Peel et al. [9] 
discuss classification for directional data and our method can be extended to 
distributions with non-euclidean domain, combining the arguments below with 
methods in Holzmann et al. [5]. As prominent examples we mention the von 
Mises family for directional data and the Kent family for spherical data. 
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Continuity of likelihood ratios. Suppose that Pg — Q^{e) with parameters 
^(1), . . . , ^{L) in S which are not all identical. Then one can easily verify that 
continuity of C{'We{X)) or C{Tg{X)) follows from the following condition: 
The family {Q^)^i£e is called locally identifiable, if for arbitrary m S N, pairwise 
different parameters ^(1), . . . ,£,{ni) e S and numbers . . . , P„i G IR, 

A/jx e A" : ^ Pjg^{j) (x) = o| > implies that /3i = /32 = ■ = = 0. 

Local identifiability entails the following conclusion: Suppose that Q is equal 
to X^Jli ^jQiij) some number to G N, pairwise different parameters ^(1), 
. . . , ^(to) in S and nonnegativc numbers Ai, . . . , Am. Then one can determine 
the ingredients to, ^(1), . . . ,^(to) and Ai, . . . , Am from the restriction of Q to 
any fixed measurable set Bo C X with M{Bo) > 0. The following theorem 
provides a sufficient criterion for local identifiability which is easily verified in 
many standard examples. 

Theorem 5.1. Let M be Lebesgue measure on X — Xi x X2 x ■ ■ ■ x Xg with 
open intervals Xk C R. Suppose that the following two conditions are satisfied: 

(i) {Q^)^es is identifiable; 

(ii) for arbitrary ^ € S, fc £ {1, 2, . . . , 5} and Xi G Xi, i ^ k, the function 

t ^ g^{xi, . . . ,Xk-l,t,Xk+l, ■ . ■ ,Xq) 

may be extended to a holomorphic function on some open subset of<C containing 
Xk- 

Then the family {Q^)^e'S. is locally identifiable. 

One can easily verify that Condition (ii) of Theorem 5.1 is satisfied by the 
densities 5^ in (5.1), if the density generators hq{-; Q may be extended to holo- 
morphic functions on some open subset of C containing [0, 00). Hence, for in- 
stance, the family of all multivariate t-distributions is locally identifiable. 



6. Proofs 

Proof of Theorem 3.1. Since the distributions Pi, . . . ,Pi are mutually abso- 
lutely continuous. Condition (3.3) entails that 

p(e,7Vi,...,7Vi) 

:= max j V{\Tg{x,V,{z)) - T^{x)\ > e) Pa{dx)Pb{dz) 

a.fceO; l,...,n J 

tends to zero for any fixed e > 0. 

It follows from the elementary inequality 

|l{r > s}- l{ro > So}| < l{\r-ro\>e} + l{\s-So\>e} + l{\ro-So\<2e} 
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for real numbers r,ro,s,So that 

MX.-D) = {Ne + l)-'ll+Y,HTe{X,,VdX))>Ts{X,V)}\ 

\ ieGo / 
= H l{TeiX,,V,{X)) > Tg{X,V)]+Ri 

= ^e' E ^{Te{X^) > Tg°{X)} + R, + R^ie), 

where |i?i| < {Ng + 1)-^ and 

|i?2(e)| < N-'#{i e Qe : \TgiX,,V,{X)) - T°(X,)| > e} 
+ l[\Te{X,V)-T^{X)\ >e} 
+ N-'#{t G Ge : \Tg"{X,) - Tg°{X)\ < 2e}. 

Hence E |i?2(e)| < 2p(e,iVi, . . . , TVl) + t^(2e) -> tj(2e), where 

uj{6) supPejz G X : \T°{z) - r\ < 3} I {d I 0) 

by virtue of Condition (3.4). These considerations show that 

MX,V) = Fg{T°{X)) + Op{l) = FgiT°{X)) + Opil), 

where 

Feir) := Pg{z e X : Tg°{z) > r} , 
Fe{r) Pg {z X : T^^ {z) > r] . 

Here we utihzcd the well-known fact [11] that \\Fg — i^e||cxj = Op{\). Since 
TT°g{X) = Fg{T°{X)), this entails Conclusion (3.5). 

As to the remaining assertions (3.6-3.7), note first that (3.5) implies that 

t{€,Ni,...,Nl) 

:= max [ TP(\ng{x,V,{z)) - n^{x)\ > e) Pa{dx)Pb{dz) 

a.fceO; i— l,...,n J 

tends to zero for any fixed e > 0, again a consequence of mutual absolute 
continuity of Pi, . . . , Pl- Similarly as in the proof of (3.5) one can verify that 

Ia{b,9\V)=]P{TTg{X,V)> a\Y ^b,V) = Gb,e{a) + R{e), 

i&Gb 

= Gb,0(a) + P(e)+Op(l), 
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with Gb,e{u) := Pb{z £ X : Trg{z) > u} and Gb,e{u) := Pb{z G X : TTg{z) > u}, 
while 

IE|i?(e)| < T{e,Ni,...,NL) + JP{\n"g{X)-a\<e\Y ^b) 
^ P(k,°(X)-a| <e|r = 6), 

IE|i?(e)| < T{€,Ni,...,Nb-i,Nb-l,Nb+i,...,NL) 
+ P(|^g(X)-a| <e\Y = b) 
P(|7r°(X)-a| <e|r = fe). 

Since the latter probability tends to zero as e J, 0, we obtain Claim (3.7). 
This implies Claim (3.6), because 



Wf,Ia{b, 9 I V) 

bee 

Y,WbTPi7rliX)> a\Y = b) = n^{TT°s). □ 



bee 

Proof of Lemma 3. 2. It is a simple consecjuence of the weak law of large num- 
bers that /ih lib ■■= TEiX I y &) and E E ;= J2b=i ^6 Var(X \Y = b). 
Now one can easily show that (3.3) is satisfied with Tg defined as in (2.1). The 
results from Section 5 entail that Lebqja; G M'' : Tg{x) = c} = for any c > 0, 
so that (3.4) is satisfied as well. □ 

Proof of Lemma 3.3. The assumptions imply the existence of a Borel set Xo C 
X with W{X G Xo) = 1 such that the following additional requirements are 
satisfied: 

P(X G B{x, r)) > for aU x e Xo,r > 0, (6.1) 

1™§7S^4t =T^(-^) for all 0,6 go, a; gA'o. (6.2) 
no Pe{B{x,r)) fe^ ' 

In case of continuous densities /i , /2 , • ■ • , ./l > on a separable metric space 
{X, d), this is easily verified with Xo being the support of C{X), i.e. the smallest 
closed set such that P(X G Xo) = 1. In case oi X = M.'^ and d{x,y) = \\x — y\\, 
existence of such a set Xo is a known result from geometric measure theory; cf. 
Federer [2, Theorem 2.9.8]. 

In view of (6.1-6.2), it suffices to show that for arbitrary fixed x ^ Xo and 

6g e, 

- / N n A Pb{Bix,?k{n)ix))) 

rk(n){x) -^p and , ^ ---T -^p 1. (6.3) 

Ph(B(x,rfc(„)(x))) 

To this end, note first that the random numbers N(x, r) :— ^{i : d{Xi, x) < r} 
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satisfy 

W]N{x,r) = J2^sPe{z : d{z,x) <r} 



dee 

n(TP{d{X, x) <r) + o(l)) uniformly in r > 0, (6.4) 



Ya.r{N{x,r)) = J2 ^oPe{z : d{z, x) < r}{l - Peiz : d{z, x) < r}) 
dee 

< min{TEN{x,r),n/4:}. (6.5) 
If we define r„ := max {r > : TEN{x,r) < k{n)/2}, then 

F(rfc(„)(x) < r„) = ]P{N{x,r„)>k{n)) 

< JP{N{x,r„)-JENix,r„) > k{n)/2) 

< ]EN{x,r„)/ik{n)/2)^ 

< 2/k{n) -> 

by Tshebyshev's inequality and (6.5). On the other hand, for any fixed e > 0, 
P(rfe(„)(a;) >e) = F(iV(x, e) < fc(n)) 

= TP(^N{x, e) - ]EN{x, e) < n(o(l) - TP{d{X, x) < e))) 
= 0(l/n) 

according to (6.4) and (6.1). These considerations show that r^n){x) 0, but 
^k{n} i^) ^ i^n with asymptotic probability one. Now we utilize that the process 

, ^ h{B{x.r)) ^ 
Pb{B{x,r)) 

is a zero mean reverse martingale on |r > : ]P{d{X, x) < r) > O} D (0, oo), so 
that Doob's inequality entails that 



IE sup 

r>rn 



Pb{B{x,r)) 



Pb{B{x,r)) 



- NbPbiB{x,r„)) ^ ^ ^ ^ 



see Shorack and Wellner [If, Sections 3.6 and A. 10-11]. The latter considerations 
imply the second part of (6.3). □ 

Proof of Theorem 5.1. The proof is by contradiction. To this end suppose that 
there are m > 2 pairwise different parameters ^(1), ^(2), . . . , ^(m) £ S and 
nonzero real numbers f3i, /32, . . ■ , Pm such that h := X^IILi Pi9i(i) satisfies 

hehq{W) > with W := {x e X : h{x) = 0}. 

In case of g = 1, this entails that W C X — Xi contains an accumultation 
point within Xi, and the identity theorem for analytic functions yields that 
h = on X. But this would be a contradiction to ((5{)jgh being identifiable. 
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In case of g > 1, by Fubini's theorem, 

Lehq{W) = I Lebi{i : (x',t) e M^}Leb,_i(da;') > 0, 

J A'l X ■ ■ ■ X A", _ 1 

whence Lebi{t : [x' ,t) G > for all x' in a measurable set W' C A:"! x 
• • ■ X Xq-i such that Lebq_i(W^') > 0. Hence the identity theorem for analytic 
functions, applied to i ^-» h{x' , t) implies that 

W y.Xq C W. 

Since Lebq_i(M^') > 0, we may proceed inductively, considering for k = q — 
1, g — 2, . . . , 1 the functions t h{x" , t, Xk+i, ■ • ■ , Xq) on X^- Eventually we 
obtain W = X, but this would be a contradiction to (Q^)^eH being identifiable. 

□ 

Proof of Lemma 3.4- For any permutation ct of (1, 2, . . . , L) let TPa{-) and Ca{-) 
denote probabilities and distributions in case of Pi, = Qa{b) for = 1, 2, . . . , L. 
By assumption (3.8), for any such a there is a set Aa of potential training data 
sets V such that Pct(2? € Aa) = I and 

J l{TTg{x,I)) < a} Q„(^g^{dx) < a whenever P G ^cr. 

Since the L! distributions Ca{T>) are mutually absolutely continuous, the in- 
tersection A := Plcr'^'T satisfies Wa{T> G ^) = 1 for any permutation a. But 
then 

j l{7re(a;, V) < a} Qb{dx) < a for aU 6 G 6, P G A. 

This implies that P(7re(X, 2?) < a\Y = b,!)) < a almost surely for all 6 G 8, 
provided that (Pi, . . . , Pl) is a permutation of {Qi, . . . , Ql)- ^ 
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